Author

Kevin Cisneros-Cuevas, Soren Fliegel, Chris Liu, Haoxian Liu

Code
library(tidyverse)
library(gganimate)
library(knitr)
library(gifski)
library(broom)
library(kableExtra)

Previous code from project proposal:

Code
life_expectancy <- read.csv("lex.csv")
gdp_per_capita <- read.csv("gdp_pcap.csv")

clean_life_expectancy <- life_expectancy |> 
  # calculating proportion of missing values for each row
  # source: https://www.statology.org/rowmeans-in-r/
  filter(rowMeans(across(-country, is.na)) <= 0.2) |> 
  # gsub: https://www.statology.org/gsub-r/
  # rename: https://dplyr.tidyverse.org/reference/rename.html
  rename_with(~ gsub("^X", "lex_", .), starts_with("X"))

convert_k <- function(vec){
  has_k <- str_detect(vec, "k")
  vec_numeric <- as.numeric(str_replace(vec, "k", "")) 
  return(vec_numeric * ifelse(has_k, 1000, 1))   
}

clean_gdp_per_capita <- gdp_per_capita |> 
  rename_with(~ gsub("^X", "gdp_per_capita_", .), starts_with("X")) |> 
  # all columns type -> numeric
  mutate(across(-country, convert_k))

long_life_expectancy <- clean_life_expectancy |> 
  pivot_longer(
    cols = starts_with("lex_"), 
    names_to = "year", 
    names_prefix = "lex_", 
    values_to = "life_expectancy"
  )

long_gdp_per_capita <- clean_gdp_per_capita |> 
  pivot_longer(
    cols = starts_with("gdp_per_capita_"), 
    names_to = "year", 
    names_prefix = "gdp_per_capita_", 
    values_to = "gdp_per_capita"
  )

clean_long_df <- long_life_expectancy |> 
  inner_join(long_gdp_per_capita, 
             by = c("country", "year")) 

2.1 Data Visualization

1. The relationship between two quantitative variables you are investigating:

Code
clean_long_df |>
  filter(year == 2017) |>
  ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
  geom_jitter(alpha = 0.5, size = 1) +
  labs(
    x = "GDP per Capita (2017 US$ PPP)",
    y = "",
    title = "GDP per Capita vs. Life Expectancy in 2017 Globally",
    subtitle = "Life Expectancy (in years)") + 
  theme_linedraw()

Above is a graph scatter plot depicting the relationship between GDP per Capita versus life expectancy for the year 2017 and in 2017 USD; each dot is a country. The year 2017 was chosen because of its recency and the fact that all currency was adjusted to that year. The two dots in the bottom left represent Lesotho and the Central African Republic, both of which were suffering from unrest in 2017. The dots furthest to the right represent Luxembourg, Qatar, and Singapore; while they have the highest wealth per citizen, they don’t necessarily have the longest life expectancy.

We see a positive relationship between the two variables. At first, as GDP per Capita increases, life expectancy increases dramatically. Then, as GDP per Capita reaches increases further, towards the bound of our graph, the positive relationship is less clear and there are diminishing returns in terms of life expectancy. This pattern is similar to the graph of a logarithm. This graph shows an intensely unequal world, with GDP per Capita varying greatly for countries of varying development and wealth. Additionally, we see an unequal spread of life expectancy even among wealthier countries, where wealth inequality and various healthcare systems likely come into play. Another reason for choosing a more recent year is our ability to see the full spread of the pattern from poor countries to very wealthy countries, as this inequality was not present to the same degree in 1800.

2. How this relationship (from #1) has changed over time:

Code
clean_long_df$year <- as.numeric(clean_long_df$year)

animated_over_time <- clean_long_df |>
  filter(year <= 2017) |>
  ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
  geom_jitter(alpha = 0.5, size = 1) +
  labs(
    x = "GDP per Capita (2017 US$ PPP)",
    y = "",
    title = "GDP per Capita vs. Life Expectancy: Year {round(frame_time)}",
    subtitle = "Life Expectancy (in years)") +
  theme_linedraw() +
  transition_time(year) +
  ease_aes("linear")

# Source: https://gganimate.com/
# should show in the "viewer" tab and also when rendered in html
animate(animated_over_time, renderer = gifski_renderer(), nframes = 150, duration = 20)

This animation shows the same graph from above (GDP per Capita vs Life Expectancy) over the years 1800-2017. The value here comes from our ability to see an increasing gap between various country’s wealth levels (and life expectancy as a result), but also in the overall life expectancy rising across the globe. This chart reinforces the idea presented above of ever-increasing wealth inequality globally, with some countries exploding in wealth and others staying nearly the same, all from relatively similar starting points in the year 1800. This animation also shows that the logarithmic pattern is followed with a medium strength.

2.2 Linear Regression

Before running the linear regression, summarising the data to one x value and one y value per country in the years 1939 to 1945 would be a great way to simplify the regression model and ensuring each country is represented consistently. We decided to investigate 1939 to 1945 because these are the years the Second World War occured.

The linear regression contains two variables of interest: average GDP per capitia for each country during the Second World War (explanatory variable: \(X\)) and average life expectancy for each country during the Second World War (response variable: \(Y\)).

Code
options(scipen = 999) # disable scientific notation!
summarised_lm <- clean_long_df |> 
  filter(year %in% 1939:1945) |>
  group_by(country) |> 
  summarize(
    avg_gdp_per_capita = mean(gdp_per_capita, na.rm = TRUE),
    avg_life_expectancy = mean(life_expectancy, na.rm = TRUE)
    ) |> 
  ungroup()

# Get Regression Model:
lm_model <- lm(avg_life_expectancy ~ avg_gdp_per_capita, 
               data = summarised_lm)

From the linear regression model, the population regression model is represented by this equation:

\[ Y = 35.321 + 0.0022X + \epsilon \]

Where \(\epsilon\) ~ \(\textit{N}(0, \sigma)\) accounts for random noise.

2.3 Model Fit

Code
summary_var <- lm_model |> 
  augment() |> 
  # we only need three columns
  select(avg_life_expectancy, .fitted, .resid) |> 
  pivot_longer(cols = c(avg_life_expectancy, .fitted, .resid),
               names_to = "variables",
               values_to = "values"
               ) |> 
  map_at("variables", as.factor) |> 
  bind_cols() |> 
  # change the variable names
  mutate(
    variables = fct_recode(variables, 
                          "Response" = "avg_life_expectancy",
                          "Fitted" = ".fitted",
                          "Residual" = ".resid")
  ) |> 
  # make table fancy
  group_by(variables) |> 
  summarize(variance = round(var(values), 2)) |> 
  kable(col.names = c("Variable Name", "𝛔̂²"),
        caption = "Table 1: Variability of the Regression Model",
        align = "c") |> 
  
  kable_classic(full_width = F, html_font = "Cambria")


summary_var
Table 1: Variability of the Regression Model
Variable Name 𝛔̂²
Fitted 43.29
Residual 51.17
Response 94.46

In Table 1, the estimated variances have been calculated for the predicted life expectancy (fitted values), the residuals, and the actual life expectancy (response values). First, the variance of the response values represents the total amount of variation in life expectancy across observations. Second, we have the variance of the fitted value, which captures how much of the variability in life expectancy is explained by GDP per capita. Third, is the residual variance, which represents the unexplained variability - that is, the portion of life expectancy variation that GDP per capita does not account for.

Assessing the Proportion of Variability Explained by the Model

To determine the proportion of the variability in the response values that was accounted in our model, we would first need to calculate the , which is done by doing this:

\[ R^2 = \frac{\hat{\sigma}^2_{\text{Fitted}}}{\hat{\sigma}^2_{\text{Response}}} \]

Code
R2 <- 43.29 / 94.46  

round(R2 * 100, 2)
[1] 45.83

Based on the result, our model explains about 45.83% of the variability in life expectancy using GDP per capita. This means that 54.17% of variability remains unexplained. With an of 45.83%, the quality of our model is moderate. This suggest that the model is useful but not highly predictive. Although it will give us an insight into the relationship between economic prosperity and life expectancy, it lacks other factors needed for a highly accurate prediction. In other words, GDP per capita is not the sole determining factor for a person’s life expectancy. There are other things to consider as well, such as healthcare access, education, environmental conditions, and government policies.

3.1 Visualizing Simulations from the Model

Code
set.seed(537) # reproducibility 

# function for adding noise
rand_error <- function(x, mean = 0, sd) {
  x + rnorm(length(x), mean = mean, sd = sd)
}

pred_life <- predict(lm_model)
est_sigma <- sigma(lm_model)
sim_response <- tibble(simulated_response = rand_error(pred_life, sd = est_sigma))

full_data <- summarised_lm |>
  select(avg_gdp_per_capita, avg_life_expectancy) |>
  bind_cols(sim_response) 

full_data |>
  ggplot(aes(x = avg_gdp_per_capita, y = simulated_response)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Simulated Data GDP vs Life Expectancy, 1939 - 1945",
       x = "GDP Per Capita", 
       y = " ",
       subtitle = "Simulated Life Expectancy") +
  theme_linedraw() + 
  xlim(0, 20000) + 
  ylim(0, 100)
full_data |>
  ggplot(aes(x = avg_gdp_per_capita, y = avg_life_expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Observed GDP vs Life Expectancy, 1939 - 1945",
       x = "GDP per Capita", 
       y = "",
       subtitle = "Actual Life Expectancy") +
  theme_linedraw() + 
  xlim(0, 20000) + 
  ylim(0, 100)

Above is a side-by-side showcasing the relationship between GDP and life expectancy for the years 1939-1945. On the right is the actual life expectancy, and on the left is our simulated model. We can see that there are several key similarities and differences between the visualizations. In both, we can see that the positive relationship is maintained, although it is more clearly visible in our simulated data due to an outlier in what actually happened (the low dots on the right). The outlier, far below the predictive line, may have been experiencing unique economic conditions as a result of the war. Otherwise, the simulated data is visibly similar to the actual data, with a large clumping of countries from 0-5000 GDP per capita and life expectancy from 25-50. These countries were likely developing or heavily impacted by World War II, containing both life expectancy and GDP growth. As mentioned earlier, our model does a reasonable job of capturing the relationship between the two variables. However, other real-world events introduced complexities that it couldn’t perfectly account for.

Simulated Vs. Observed Relationship

Based on comparing the shape of our scatterplots, the simulated data looks quite similar to what was observed, indicating that our model is doing a pretty good job of predicting life expectancy. Now, let’s plot the relationship between the simulated and observed life expectancy for a closer look.

Code
full_data |> 
  ggplot(aes(x = simulated_response, 
             y = avg_life_expectancy)
         ) + 
  geom_point() + 
   labs(title = "Relationship between Simulated and Observed Life Expectancy (1939 - 1945)",
        x = "Simulated Life Expectancy (in years)", 
        y = "",
        subtitle = "Observed Life Expectancy (in years)" ) + 
  geom_abline(slope = 1,
              intercept = 0, 
              color = "steelblue",
              linetype = "dashed",
              lwd = 1.5) +
  theme_bw()

If the simulated data were identical to the observed data, then the data points would be directly on the dashed blue line, indicating a perfect fit. Points that are above the dashed line is when the predicted life expectancy is greater than the actual observed life expectancy. The ones below the dashed line is when the predicted life expectancy is less than the actual observed life expectancy. And in this specific case, we could see that there are roughly the same amount of overestimates and underestimates. Since they are not extremely close to the line, we determine that there is a “moderate” relationship between the observed values and simulated values. There is still some unexplained variation, meaning other factors might be influencing life expectancy that the model does not capture. This makes sense since we have also previously found in the model fit statistics that our model only explains about 45.83% (R² ≈ 0.4583) of the variability in life expectancy using GDP per capita.

3.2 Generating Multiple Predictive Checks

Previously, we conducted a single simulation to compare our model’s predictions with the observed data. However, a single simulation only provides one possible set of predicted life expectancy which is not sufficient because in the real-world, data is influenced by random variation and uncertainty, so we would need to account for that as well. To better evaluate our model’s performance, we chose to generate 1500 simulated datasets. A 1500 simulations will provide a good balance between computational efficiency and statistical reliability. Too few simulations may produce an inaccurate representation of the model’s performance, while a much larger number would take a longer time and it might not result in any significant changes.

Regression of each Simulated on the Original

Now, once we have generated all 1500 simulations, all we have to do now is to go through an iterative process of fitting models for each of the simulated datasets. Doing this will allow us to assess the variability in predictions, determine whether our model systematically overestimates or underestimates the life expectancy, and evaluate its overall reliability. The goal is to ensure that our conclusion are not based on just a single outcome of the simulated data but instead account for different possible outcomes.

Extracting the R² Values

Now, for us to determine how the model performed across many simulated datasets, we will need to take a look at the R²value, which essentially tells us how “close” the simulated values and the observed values are. R²value ranges from 0-1, where 1 indicates that a model did a very good job, whereas 0 indicates that the estimated relationship is extremely weak or is not close at all.

Code
# regress each simulated on the original, and keep the R2 from each regression
sim_r_sq <- sims |> 
  map(~ lm(avg_life_expectancy ~ .x, data = sims)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)

sim_r_sq <- sim_r_sq[names(sim_r_sq) != "avg_life_expectancy"]

Plotting the Distribution of Simulated R² Values

Once we have extracted all of the R²values for each of the simulations, we can plot their distribution to observe the range in model performance. The distribution of these R²values will provide insight into how well our assumed model produce data similar to what was observed.

Code
tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025) +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models") +
  theme_bw()

Looking at the distribution, it seems like the simulated datasets have R² values between approximately 0.08 to 0.36. On average, our simulated data account for about 21.2 % of the variability in the observed life expectancy. Therefore, the data simulated under this statistical model are low to moderately similar to what was observed. The model captures some of the patterns in the data but leaves a significant portion of the variability unexplained. The fact that most R² values are relatively low (with no values approaching 1) indicates that the model is not doing a really strong job of explaining the variation in life expectancy. The outcome is not entirely surprising because life expectancy is influenced by multitudes of factors such as education, health, environment conditions, government policies, and much more. Relying solely on GDP per capita as a predictor is likely insufficient to capture all these influences. Overall, while our model generates data that reflects some trends in the observed life expectancy, its limited explanatory power suggests that additional explanatory variables are needed to better predict life expectancy.